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We study the asymptotic properties of bridge estimators in sparse, 
high-dimensional, linear regression models when the number of co- 
variates may increase to infinity with the sample size. We are par- 
ticularly interested in the use of bridge estimators to distinguish be- 
tween covariates whose coefficients are zero and covariates whose co- 
efficients are nonzero. We show that under appropriate conditions, 
bridge estimators correctly select covariates with nonzero coefficients 
with probability converging to one and that the estimators of nonzero 
coefficients have the same asymptotic distribution that they would 
have if the zero coefficients were known in advance. Thus, bridge es- 
timators have an oracle property in the sense of Fan and Li [J. Amer. 
Statist. Assoc. 96 (2001) 1348-1360] and Fan and Peng [Ann. Statist. 
32 (2004) 928-961]. In general, the oracle property holds only if the 
number of covariates is smaller than the sample size. However, under 
a partial orthogonality condition in which the covariates of the zero 
coefficients are uncorrelated or weakly correlated with the covariates 
of nonzero coefficients, we show that marginal bridge estimators can 
correctly distinguish between covariates with nonzero and zero coef- 
ficients with probability converging to one even when the number of 
covariates is greater than the sample size. 

1. Introduction. Consider the linear regression model 

Yi = Po + x'iP + ei, i = l,...,n, 

where Y{ € M is a response variable, Xj is a p n x 1 covariate vector and the 
£j's are i.i.d. random error terms. Without loss of generality, we assume 
that Pq = 0. This can be achieved by centering the response and covariates. 
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We are interested in estimating the vector of regression coefficients (3 G M Pn 
when p n may increase with n and (3 is sparse in the sense that many of its 
elements are zero. We estimate j3 by minimizing the penalized least squares 
objective function 



where A n is a penalty parameter. For any given 7 > 0, the value (3 n that 
minimizes (1) is called a bridge estimator [Frank and Friedman (1993) and 
Fu (1998)]. The bridge estimator includes two important special cases. When 
7 = 2, it is the familiar ridge estimator [Hoerl and Kennard (1970)]. When 
7 = 1, it is the LASSO estimator [Tibshirani (1996)], which was introduced 
as a variable selection and shrinkage method. When < 7 < 1, some compo- 
nents of the estimator minimizing (1) can be exactly zero if A n is sufficiently 
large [Knight and Fu (2000)]. Thus, the bridge estimator for < 7 < 1 pro- 
vides a way to combine variable selection and parameter estimation in a 
single step. In this article we provide conditions under which the bridge es- 
timator for < 7 < 1 can correctly distinguish between nonzero and zero 
coefficients in sparse high-dimensional settings. We also give conditions un- 
der which the estimator of the nonzero coefficients has the same asymptotic 
distribution that it would have if the zero coefficients were known with cer- 
tainty. 

Knight and Fu (2000) studied the asymptotic distributions of bridge es- 
timators when the number of covariates is finite. They showed that, for 
< 7 < 1, under appropriate regularity conditions, the limiting distribu- 
tions can have positive probability mass at when the true value of the 
parameter is zero. Their results provide a theoretical justification for the 
use of bridge estimators to select variables (i.e., to distinguish between co- 
variates whose coefficients are exactly zero and covariates whose coefficients 
are nonzero). In addition to bridge estimators, other penalization methods 
have been proposed for the purpose of simultaneous variable selection and 
shrinkage estimation. Examples include the SCAD penalty [Fan (1997) and 
Fan and Li (2001)] and the Elastic-Net (Enet) penalty [Zou and Hastie 
(2005)]. For the SCAD penalty, Fan and Li (2001) studied asymptotic prop- 
erties of penalized likelihood methods when the number of parameters is 
finite. Fan and Peng (2004) considered the same problem when the number 
of parameters diverges. Under certain regularity conditions, they showed 
that there exist local maximizers of the penalized likelihood that have an 
oracle property. Here the oracle property means that the local maximizers 
can correctly select the nonzero coefficients with probability converging to 
one and that the estimators of the nonzero coefficients are asymptotically 
normal with the same means and covariances that they would have if the 
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zero coefficients were known in advance. Therefore, the local maximizers are 
asymptotically as efficient as the ideal estimator assisted by an oracle who 
knows which coefficients are nonzero. 

Several other studies have investigated the properties of regression esti- 
mators when the number of covariates increases to infinity with sample size. 
See, for example, Huber (1981) and Portnoy (1984, 1985). Portnoy (1984, 
1985) provided conditions on the growth rate of p n that are sufficient for con- 
sistency and asymptotic normality of a class of M-estimators of regression 
parameters. However, Portnoy did not consider penalized regression or se- 
lection of variables in sparse models. Bair et al. (2006) proved consistency of 
supervised principal components analysis under a partial orthogonality con- 
dition, but they also did not consider penalized regression. There have been 
several other studies of large sample properties of high-dimensional prob- 
lems in settings related to but different from ours. Examples include Van 
der Laan and Bryan (2001), Biihlmann (2006), Fan, Peng and Huang (2005), 
Huang, Wang and Zhang (2005), Huang and Zhang (2005) and Kosorok and 
Ma (2007). Fan and Li (2006) provide a review of statistical challenges in 
high-dimensional problems that arise in many important applications. 

We study the asymptotic properties of bridge estimators with < 7 < 1 
when the number of covariates p n may increase to infinity with n. We are 
particularly interested in the use of bridge estimators to distinguish between 
covariates with zero and nonzero coefficients. Our study extends the results 
of Knight and Fu (2000) to infinite-dimensional parameter settings. We show 
that for < 7 < 1 the bridge estimators can correctly select covariates with 
nonzero coefficients and that, under appropriate conditions on the growth 
rates of p n and A n , the estimators of nonzero coefficients have the same 
asymptotic distribution that they would have if the zero coefficients were 
known in advance. Therefore, bridge estimators have the oracle property of 
Fan and Li (2001) and Fan and Peng (2004). The permitted rate of growth 
of p n depends on the penalty function form specified by 7. We require that 
p n < n; that is, the number of covariates must be smaller than the sample 
size. 

The condition that p n < n is needed for identification and consistent es- 
timation of the regression parameter. While this condition is often satisfied 
in applications, there are important settings in which it is violated. For 
example, in studies of relationships between a phenotype and microarray 
gene expression profiles, the number of genes (covariates) is typically much 
greater than the sample size, although the number of genes that are actually 
related to the clinical outcome of interest is generally small. Often a goal 
of such studies is to find these genes. Without any further assumption on 
the covariate matrix, the regression parameter is in general not identifiable 
if p n > n. However, if there is suitable structure in the covariate matrix, it 
is possible to achieve consistent variable selection and estimation. A special 
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case is when the columns of the covariate matrix X are mutually orthogonal. 
Then each regression coefficient can be estimated by univariate regression. 
But, in practice, mutual orthogonality is often too strong an assumption. 
Furthermore, whenp n > n, mutual orthogonality of all covariates is not pos- 
sible, since the rank of X is at most n — 1. We consider a partial orthogonal- 
ity condition in which the covariates with zero coefficients are uncorrelated 
or only weakly correlated with the covariates with nonzero coefficients. We 
study a univariate version of the bridge estimator. We show that under the 
partial orthogonality condition and certain other conditions, the marginal 
bridge estimator can consistently distinguish between zero coefficients and 
nonzero coefficients even when the number of covariates is greater than n, 
although it does not yield consistent estimation of the parameters. After 
the covariates with nonzero coefficients are consistently selected, we can 
use any reasonable method to consistently estimate their coefficients if the 
number of nonzero coefficients is relatively small, as it is in sparse models. 
The partial orthogonality condition appears to be reasonable in microarray 
data analysis, where the genes that are correlated with the phenotype of 
interest may be in different functional pathways from the genes that are not 
related to the phenotype [Bair et al. (2006)]. Fan and Lv (2006) also studied 
univariate screening in high-dimensional regression problems and provided 
conditions under which it can be used to reduce the exponentially growing 
dimensionality of a model. 

The rest of this paper is organized as follows. In Section 2 we present 
asymptotic results for bridge estimators with < 7 < 1 and p n — ► 00 as 
n — > 00. We treat a general covariate matrix and a covariate matrix that 
satisfies our partial orthogonality condition. In Section 3 we present results 
for marginal bridge estimators under the partial orthogonality condition. In 
Section 4 simulation studies are used to assess the finite sample performance 
of bridge estimators. Concluding remarks are given in Section 5. Proofs of 
the results stated in Sections 2 and 3 are given in Section 6. 

2. Asymptotic properties of bridge estimators. Let the true parameter 
value be (3 n0 . The subscript n indicates that /3 n0 depends on n, but for 
simplicity of notation, we will simply write /3 . Let /3 = (/3^ , /32o)'' where 
/3 10 is a k n x 1 vector and /3 2 o is a m n x 1 vector. Suppose that (3 10 ^ and 
020 = 0' wn ere is the vector with all components zero. So k n is the number 
of nonzero coefficients and m n is the number of zero coefficients. We note 
that it is unknown to us which coefficients are nonzero and which are zero. 
We partition f3 this way to facilitate the statement of the assumptions. 

Let Xj = (xn, . . . ,Xi Pn )' be the p n x 1 vector of covariates of the ith ob- 
servation, i = l,...,re. We assume that the covariates are fixed. However, 
we note that for random covariates, the results hold conditionally on the 
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covariates. We assume that the Yj's are centered and the covariates are 
standardized, that is, 

n n -i n 

(2) ^y i = 0, ^ Xij = and -^x?- = l, j = l,...,p n . 

i=l i=l i=l 

We also write Xj = (w^,z^)', where Wj consists of the first k n covariates 
(corresponding to the nonzero coefficients), and Zj consists of the remaining 
m n covariates (those with zero coefficients). Let X n , Xi n and X2 n be the 
matrices whose transposes are = (xi, . . . ,x n ), X' ln = (wi, . . . , w n ) and 
X' 2n = (zi, . . . , z n ), respectively. Let 

S n = n _1 X^X n and E in = n~ X' ln K ln . 

Let pi n and p2 n be the smallest and largest eigenvalues of E n , and let T± n 
and T2n be the smallest and largest eigenvalues of Si n , respectively. 

We now state the conditions for consistency and oracle efficiency of bridge 
estimators with general covariate matrices. 

(Al) £$,£2, ... are independent and identically distributed random vari- 
ables with mean zero and variance a 2 , where < a 2 < oo. 
(A2) (a) pi n > for all n; (b) (jp n + \ n kn){npin)~ l -> 0. 
(A3) (a) A n (fc n A0 1/2 ^0; (b) \ n n-^ 2 ( P i n / ^) 2 ^ -»■ oo. 
(A4) There exist constants < bo < bi < oo such that 

b < min{|/3ij|, 1 < j < k n } < max{|/3y|, 1 < j < A; n } < bi. 

(A5) (a) There exist constants < n < T2 < oo such that r\ < r\ n < T2 n < 
T2 for all n; (b) 

n 1 max w, w.; — > 0. 

l<i<n 

Condition (Al) is standard in linear regression models. Condition (A2)(a) 
implies that the matrix S n is nonsingular for each n, but it permits pi n — ► 
as n — > oo. As we will see in Theorem 2, pi n affects the rate of convergence 
of bridge estimators. Condition (A2)(b) is used in the consistency proof. 
Condition (A3) is needed in the proofs of the rate of convergence, oracle 
property and asymptotic normality. To get a better sense of this condition, 
suppose that < ci < p ln < p2 n < C2 < oo for some constants c\ and C2 and 
for all n and that the number of nonzero coefficients is finite. Then (A3) 
simplifies to 

(A3)* (a) A n n-V2^ ; (b) XlrT^ 2 '^ -> oo. 

Condition (A3)* (a) states that the penalty parameter A n must always 
be o(?i 1 / 2 ). Suppose that A n = n^ 1 " 5 )/ 2 for a small 5 > 0. Then (A3)*(b) 
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requires that p n ~^ /n}~ & ~^ — > 0. So the smaller the 7, the larger p n is al- 
lowed. This condition excludes 7 = 1, which corresponds to the LASSO es- 
timator. If p n is finite, then this condition is the same as that assumed by 
Knight and Fu [(2000), page 1361]. Condition (A4) assumes that the nonzero 
coefficients are uniformly bounded away from zero and infinity. Condition 
(A5)(a) assumes that the matrix Si n is strictly positive definite. In sparse 
problems, k n is small relative to n, so this assumption is reasonable in such 
problems. Condition (A5)(b) is needed in the proof of asymptotic normal- 
ity of the estimators of nonzero coefficients. Under condition (A3) (a), this 
condition is satisfied if all the covariates corresponding to the nonzero co- 
efficients are bounded by a constant C. This is because, under (A3) (a), 
n" 1 / 2 maxi<j< n wjwj < n~ l / 2 k n C — ► 0. 

In the following, the L 2 norm of any vector u € R Pn is denoted by ||u||; 
that is, ||u|| = E^i«?] 1/2 - 

Theorem 1 (Consistency). Let (3 n denote the minimizer of (1). Sup- 
pose that 7>0 and that conditions (Al)(a), (A2), (A3) (a) and (A4) hold. 
Let h n = p^(Pn/n) 1/2 and h' n = \{jp n + A n /c n )/(np ln )] 1 / 2 . Then ||3 n -/3 ll = 
O p (min{h n ,ti n }). 

1/2 

We note that p ln and p\ n appear in the denominators of h\ n and hm-, 
respectively. Therefore, h,2n may not converge to zero faster than h\ n if 
Pin — > 0. If pi n > p\ > for all n, Theorem 1 yields the rate of convergence 
O p (h2n) = O p ((p n /n) 1 / 2 ). If p n is finite and p\ n > p\ > for all n, then the 
rate of convergence is the familiar n _1//2 . However, if p\ n — > 0, the rate of 
convergence will be slower than n" 1 / 2 . 

This result is related to the consistency result of Portnoy (1984). If p\ n > 
pi > for all n, which Portnoy assumed, then the rate of convergence in The- 
orem 1 is the same as that in Theorem 3.2 of Portnoy (1984). Here, however, 
we consider penalized least squares estimators, whereas Portnoy considered 
general M-estimators in a linear regression model without penalty. In addi- 
tion, Theorem 1 is concerned with the minimizer of the objective function 
(1). In comparison, Theorem 3.2 of Portnoy shows that there exists a root 
of an M-estimating equation with convergence rate O p ((p n /n) 1 / 2 ). 

Theorem 2 (Oracle property). Let (3 n = (/3 ln ,/3 2n ), where (3 ln and (3 2n 
are estimators o//3 10 and /3 2 o> respectively. Suppose that < 7 < 1 and that 
conditions (Al) to (A5) are satisfied. We have the following: 

(i) f3 2n = with probability converging to 1. 
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(ii) Let s 2 = G 2 a! n Y, ln ot n for any k n x 1 vector a n satisfying \\ot n \\2 < 1- 
T/ien 

nl/2 Sn la n(Pln- Pio) 
-1/2,-1 



s^ 1 XI eiO^Wi + op(l) ->u iV(0, 1), 



n 

1=1 



where o p (l) is a term that converges to zero in probability uniformly with 
respect to a n . 



Theorem 2 states that the estimators of the zero coefficients are exactly 
zero with high probability when n is large and that the estimators of the 
nonzero parameters have the same asymptotic distribution that they would 
have if the zero coefficients were known. This result is stated in a way similar 
to Theorem 2 of Fan and Peng (2004). Fan and Peng considered maximum 
penalized likelihood estimation. Their results are concerned with local max- 
imizers of the penalized likelihood. These results do not imply existence 
of an estimator with the properties of the local maximizer without auxil- 
iary information about the true parameter value that enables one to choose 
the localization neighborhood. In contrast, our Theorem 2 is for the global 
minimizer of the penalized least squares objective function, which is a fea- 
sible estimator. In addition, Fan and Peng (2004) require that the number 
of parameters, p n , to satisfy p\jn — > 0, which is more restrictive than our 
assumption for the linear regression model. 

Let Pi n j and f3\oj be the jth components of (3 ln and /3 10 , respectively. 
Set a n = &j in Theorem 2, where is the unit vector whose only nonzero ele- 
ment is the jth element and let s 2 ^ = a 2 e'jTl^ej. Then we have n 1 / 2 s~j(fii n j — 
flioj) -W(O) !)• Thus, Theorem 2 provides asymptotic justification for the 
following steps to compute an approximate standard error of P\ n j : (i) Com- 
pute the bridge estimator for a given 7; (ii) exclude the covariates whose 
estimates are zero; (hi) compute a consistent estimator a 2 of a 2 based on 
the sum of residual squares; (iv) compute s^j = S^e^E^e,-) 1 / 2 , which gives 

an approximate standard error of f3\ n j . 

Theorem 1 holds for any 7 > 0. However, Theorem 2 assumes that 7 is 
strictly less than 1, which excludes the LASSO estimator. 

3. Asymptotic properties of marginal bridge estimators under partial or- 
thogonality condition. Although the results in Section 2 allow the number 
of covariates p n — > 00 as the sample size n — > 00, they require that p n < n. 
In this section we show that, under a partial orthogonality condition on the 
covariate matrix, we can consistently identify the covariates with zero coeffi- 
cients using a marginal bridge objective function, even when the number of 
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covariates increases almost exponentially with n. The precise statement of 
partial orthogonality is given in condition (B2) below. The marginal bridge 
estimator is computationally simple and can be used to screen out the co- 
variates with zero coefficients, thereby reducing the exponentially growing 
dimension of the model to a more manageable one. The nonzero coefficients 
can be estimated in a second step, as is explained later in this section. 
The marginal bridge objective function is 

Pn n Pn 

(4) um = EE(^ - *M 2 + A -E i&p- 

j=i i=i j=i 

Let (3 n be the value that minimizes U n . Write (3 n = (P nl ,(3 n2 y according to 
the partition f3 = (Pio,^)'- Let K n = {1, . . . , k n } and J n = {k n + 1,... ,p n } 
be the set of indices of nonzero and zero coefficients, respectively. Let 

(n \ n 

^YiXiA =n~ 1 J2(W i f3 10 )x ij , 
i=l I i=l 

which is the "covariance" between the j'th covariate and the response vari- 
able. With the centering and standardization given in (2), £ n j/ a is the cor- 
relation coefficient. 

(Bl) (a) £j,£2,... are independent and identically distributed random 
variables with mean zero and variance a 2 , where < a 2 < oo; (b) £j's are sub- 
Gaussian, that is, their tail probabilities satisfy -P(|£«| > x) < K exp(— Cx 2 ), i - 
1,2,..., for constants C and K. 

(B2) (Partial orthogonality) (a) There exists a constant cq > such that 



-1/2 \ " 



<c , jeJ n ,keK n , 



for all n sufficiently large, (b) There exists a constant £0 > such that 
mm keKn \^ nj \ >£o>0. 

(B3) (a) X n /n and A n n"^/ 2 ^- 2 -> co; (b) log(m n ) = o(l) x 
(A n n-^ 2 ) 2 /( 2 -^). 

(B4) There exist constants < 61 < 00 such that maxkeK„ \f3ik\ — &i- 

Condition (Bl)(b) assumes that the tails of the error distribution behave 
like normal tails. Thus, it excludes heavy-tailed distributions. Condition 
(B2)(a) assumes that the covariates of the nonzero coefficients and the co- 
variates of the zero coefficients are only weakly correlated. Condition (B2)(b) 
requires that the correlations between the covariates with nonzero coeffi- 
cients and the dependent variable are bounded away from zero. Condition 
(B3)(a) restricts the penalty parameter A n and the number of nonzero coef- 
ficients k n . For A n , we must have A n = o(n). For such a A n , A n n _7 / 2 /c^~ 2 = 
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o(l)n( 2 -^/ 2 ^- 2 = o(l)(n 1 / 2 /A; n ) 2 "^. Thus, k n must satisfy k n /n 1 / 2 = o(l). 
(B3)(b) restricts the number of zero coefficients m n . To get a sense how large 
m n can be, we note that A n can be as large as A n = o(n). Thus, log(m n ) = 
o(l)(n( 2 -^/ 2 ) 2 /( 2 ^) =o(l)n. So m n can be of the order exp(o(n)). This 
certainly permits m n /n — ► oo and, hence, p n /^ — ► oo as n-> oo. Similar 
phenomena occur in Van der Laan and Bryan (2001) and Kosorok and Ma 
(2007) for uniformly consistent marginal estimators under different "large p, 
small n" data settings. On the other hand, the number of nonzero coefficients 
k n still must be smaller than n. 

Theorem 3. Suppose that conditions (Bl) to (B4) hold and that < 
7 < 1 . Then 

P(3n2 = 0)->1 and P0 nlk7 tQ,keK n )^l. 

This theorem says that marginal bridge estimators can correctly distin- 
guish between covariates with nonzero and zero coefficients with probability 
converging to one. However, the estimators of the nonzero coefficients are 
not consistent. To obtain consistent estimators, we use a two-step approach. 
First, we use the marginal bridge estimator to select the covariates with 
nonzero coefficients. Then we estimate the regression model with the se- 
lected covariates. In the second step, any reasonable regression method can 
be used. The choice of method is likely to depend on the characteristics of 
the data at hand, including the number of nonzero coefficients selected in 
the first step, the properties of the design matrix and the shape of the distri- 
bution of the Si's. A two-step approach different from the one proposed here 
was also used by Bair et al. (2006) in their approach for supervised prin- 
cipal component analysis. In a recent paper Zhao and Yu (2006) provided 
an irrepresentable condition under which the LASSO is variable selection 
consistent. It would be interesting to study the implications of the irrepre- 
sentable condition in the context of bridge regression. 

We now consider the use of the bridge objective function for second- 
stage estimation of /3 10 , the vector of nonzero coefficients. Since the zero 
coefficients are correctly identified with probability converging to one, we 
can assume that only the covariates with nonzero coefficients are included 
in the model in the asymptotic analysis of the second step estimation. Let 
/3 ln be the estimator in this step. Then, for the purpose of deriving its 
asymptotic distribution, it can be defined as the value that minimizes 

n k n 

(6) u* n {p x ) = - w^) 2 + a; ]T |/?ijP, 

i=l j=l 

where /3 1 = (/3n, . . . ,0ik n )'- In addition to conditions (Bl) to (B4), we as- 
sume the following: 



10 



J. HUANG, J. L. HOROWITZ AND S. MA 



(B5) (a) There exist a constant t\ > such that r\ n > t\ for all n suffi- 
ciently large; 

(b) The covariates of nonzero coefficients satisfy n -1 / 2 maxi<j< n w^Wj — > 

0. 

(B6) (a) fc^l + A^/n-O; (b) A* (kjn) 1 / 2 - 0. 

These two conditions are needed for the asymptotic normality of /?i n . 
Compared to condition (A5)(a), (B5)(a) assumes that the smallest eigen- 
value of Ei n is bounded away from zero, but does not assume that its largest 
eigenvalue is bounded. Condition (B5)(b) is the same as (A5)(b). In con- 

dition (B6), we can set A* = for all n. Then /3 ln is the OLS estimator. 
Thus, Theorem 4 below is applicable to the OLS estimator. When A* is zero, 
then (B6)(a) becomes k n /n — > and (B6)(b) is satisfied for any value of k n . 
Condition (B5)(b) also restricts k n implicitly. For example, if the covariates 
in Wj are bounded below by a constant wq > 0, then w^Wj > k n WQ. So for 
(B5)(b) to hold, we must have k n n~ 1 / 2 — > 0. 

Theorem 4. Suppose that conditions (Bl) to (B6) hold and that < 
7 < 1. Let s 2 n = a 2 a' n T,^cx n for any k n x 1 vector a n satisfying ||o!n||2 < 1- 
Then 

n 

(7)^/^-104(3^ - f3 w ) = n-^s" 1 SiCx'n^Wi + o p (l) >d N(0, 1), 

i=i 

where o p (l) is a term that converges to zero in probability uniformly with 
respect to a n . 

4. Numerical studies. In this section we use simulation to evaluate the 
finite sample performance of bridge estimators. 

4.1. Computation of bridge estimators. The penalized objective function 
(1) is not differentiable when (5 has zero components. This singularity causes 
standard gradient based methods to fail. Motivated by the method of Fan 
and Li (2001) and Hunter and Li (2005), we approximate the bridge penalty 
by a function that has finite gradient at zero. Specifically, we approximate 

the bridge penalty function by Y^=i I-oo[ s S n { u ) / (Wl 1 ^ 2 +r])]du for a small 
rj > 0. We note this function and its gradient converge to the bridge penalty 
and its gradient as rj — > 0, respectively. 

Let p = p n be the number of covariates. Let /3 ^ be the value of the 
mth iteration from the optimization algorithm, m = 0, 1, .... Let r be a 
prespecified convergence criterion. We set r = 10 -4 in our numerical stud- 
ies. We conclude convergence if maxi<j< p 

0(m) _ < T) and condude 

$f = 0, if < r. Denote y n = (Y\, . . .,Y n ). 

Initialize f3^ = and tj = t. For m = 0, 1, . . .: 
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1. Compute the gradient of the sum of the squares gi = X^(y n — X n /3^ ^) 
and the approximate gradient of the penalty 

g 2 (r?) = iA(sgn(^ m) )/(l^! m) 1 1/2 + *?),-•■, sgn(^)/(|^ m ) \ 1 ' 2 + V ))' . 
Here gi and g2 are pxl vectors, with jth components gij and g2,, respec- 
tively. Note we use the notation g2( r l) to emphasize that the approximate 
gradient depends on r]. 

2. Compute the gradient g whose jth component, gj, is defined as 

if |/3j m) | > r, gj = g\j + g 2j (rj); 

if | Pj m) | < r, gj = gij +g 2 j{v*), 

where rf = argmax^. . 0< i^< m )|< r \9ij/92j(jl)\- I n this way, we guarantee that, 

for the zero estimates, the corresponding components in g2 dominate the 
corresponding components in gi . Update rj = rf. 

3. Re-scale g = g/maxj |gj|, such that its maximum component (in terms 
of absolute value) is less than or equal to 1. This step and the previous 
one guarantee that the increment in the components of /3 is less than r, 
the convergence criterion. 

4. Update (3 = (3 + A x g, where A is the increment in this iterative 
process. In our implementation we used A = 2 x 10 -3 . 

5. Replace mbym+1 and repeat steps 1-5 until convergence. 

Extensive simulation studies show that estimates obtained using this al- 
gorithm are well behaved and convergence is achieved under all simulated 
settings. 

4.2. Computation of marginal bridge estimators. For a given penalty pa- 
rameter A n , minimization of the marginal objective function U n defined in 
(4) amounts to solving a series of univariate minimization problems. Further- 
more, since marginal bridge estimators are used only for variable selection, 
we do not need to solve the minimization problem. We only need to deter- 
mine which coefficients are zero and which are not. 

The objective function of each univariate minimization problem can be 
written in the form 

g(u) = u 2 — 2au + A|u| 7 , 

where \a\ > 0. By Lemma A of Knight and Fu (2000), argmin(g) = if and 
only if 

2 \/2(i- 7 )y-v, 2 _, I 



Therefore, computation for variable selection based on marginal bridge es- 
timators can be done very quickly. 
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4.3. Simulation studies. This section describes simulation studies that 
are used to evaluate the finite sample performance of the bridge estima- 
tor. We investigate three features: (i) variable selection; (ii) prediction; and 
(iii) estimation. For (i), we measure variable selection performance by the 
frequency of correctly identifying zero and nonzero coefficients in repeated 
simulations. For (ii), we measure prediction performance using prediction 
mean square errors (PMSE), which are calculated from the fitted values 
based on the training data and the observed responses in an independent 
testing data not used in model fitting. For (iii), we measure estimation per- 
formance using the estimation mean square errors (EMSE) of the estimator, 
which are calculated from the estimated and true values of the parameters. 

For comparison of prediction performance, we compare the PMSE of the 
bridge estimator to those of ordinary least squares (OLS) when 
applicable, ridge regression (RR), LASSO and Enet estimators. We as- 
sess the oracle property based on the variable selection results and the 
EMSE. For the bridge estimator, we set 7 = 1/2. The RR, LASSO and 
elastic-net estimators are computed using the publicly available R packages 
(http : //www . r-proj ect . org). The bridge estimator is computed using the 
algorithm described in Section 4.1. The simulation scheme is close to the 
one in Zou and Hastie (2005), but differs in that the covariates are fixed 
instead of random. 

We simulate data from the model 

y = x'/3 + e, e~iV(0,(T 2 ). 

Six examples are considered, representing six different and commonly en- 
countered scenarios. In each example the covariate vector x is generated 
from a multivariate normal distribution whose marginal distributions are 
standard iV(0, 1) and whose covariance matrix is given in the description 
below. The value of x is generated once and then kept fixed. Replications 
are obtained by simulating the values of e from iV(0,<7 2 ) and then setting 
y = x'(3 + e for the fixed covariate value x. Summary statistics are computed 
based on 500 replications. We consider six simulation models. 

Example 1. p = 30 and a = 1.5. The pairwise correlation between the 
ith and the jth components of x is I with r = 0.5. Components 1-5 
of (3 are 2.5; components 6-10 are 1.5; components 11-15 are 0.5 and the 
rest are zero. So there are 15 nonzero covariate effects five large effects, five 
moderate effects and five small effects. 



Example 2. The same as Example 1, except that r = 0.95. 
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Example 3. p = 30 and a = 1.5. The predictors in Example 3 are gen- 
erated as follows: 

Xi = Zi + ei, Zi~iV(0,l), i = l,...,5; 

Xi = Z 2 + ei, Z 2 ~N(0,1), i = 6,...,10; 

Xi = Z 3 + ei, Z 3 ~iV(0,l), * = 11 15; 

Xj~iV(0, 1), a;, i.i.d. i = 16,...,30, 

where ej are i.i.d. iV(0,0.01),i = 1, . . . , 15. The first 15 components of (3 are 
1.5, the remaining ones are zero. 

Example 4. p = 200 and a = 1.5. The first 15 covariates (xi, . . . ,Xi$) 
and the remaining 185 covariates (x\q, . . . ,2:200) are independent. The pair- 
wise correlation between the ith and the jth components of (xi, . . . , X15) 
is r' i_ - J ' with r = 0.5, i,j = 1, ... ,15. The pairwise correlation between the 
ith and the jth components of (x\q, . . . ,a?20o) is r^~^ with r = 0.5, i,j = 
16, . . . ,200. Components 1-5 of j3 are 2.5, components 6-10 are 1.5, compo- 
nents 11-15 are 0.5 and the rest are zero. So there are 15 nonzero covariate 
effects — five large effects, five moderate effects and five small effects. The 
covariate matrix has the partial orthogonal structure. 



Example 5. The same as Example 4, except that r = 0.95. 



Example 6. p = 500 and a = 1.5. The first 15 covariates are generated 
the same way as in Example 5. The remaining 485 covariates are independent 
of the first 15 covariates and are generated independently from N(Q, 1). The 
first 15 coefficients equal 1.5, and the remaining 485 coefficients are zero. 

The examples with r = 0.5 have weak to moderate correlation among co- 
variates, whereas those with r = 0.95 have moderate to strong correlations 
among covariates. Examples 3 and 6 correspond to the "grouping effects" 
in Zou and Hastie (2005) with three equally important groups. In Exam- 
ples 3 and 6, covariates within the same group are highly correlated and 
the pairwise correlation coefficients are as high as 0.99. Therefore, there is 
particularly strong collinearity among the covariates in these two examples. 

Following the simulation approach of Zou and Hastie (2005), in each exam- 
ple, the simulated data consist of a training set and an independent valida- 
tion set and an independent test set, each of size 100. The tuning parameter 
is selected using the same simple approach as in Zou and Hastie (2005). We 
first fit the model with a given tuning parameter using the training set data 
only and compute the mean squared error between the fitted values and 
the responses in the validation data. We then search the tuning parameter 
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space and choose the one with the smallest mean squared error as the final 
penalty parameter. Using this penalty parameter and the model estimated 
based on the training set, we compute the PMSE for the testing set. We also 
compute the probabilities that the estimators correctly identify covariates 
with nonzero and zero coefficients. 

In Examples 1-3, the number of covariates is less than the sample size, 
so we use the bridge approach directly with the algorithm of Section 4.1. In 
Examples 4-6, the number of covariates is greater than the sample size. We 
use the two-step approach described in Section 3. We first select the nonzero 
covariates using the marginal bridge method. The number of nonzero covari- 
ates identified is much less than the sample size. In the second step, we use 
OLS. 

Summary statistics of the variable selection and PMSE results based on 
500 replicates are shown in Table 1. We see that the numbers of nonzero 
covariates selected by the bridge estimators are close to the true value (=15) 
in all examples. This agrees with the consistent variable selection result of 
Theorem 2. On average, the bridge estimator outperforms LASSO and ENet 
in terms of variable selection. Table 1 also gives the PMSEs of the Bridge, 
RR, LASSO and Enet estimators. For OLS (when applicable), LASSO, ENet 
and Bridge, the PMSEs are mainly caused by the variance of the random 
error. So the PMSEs are close, in general, with the Enet and Bridge being 
better than the LASSO and OLS. The RR is less satisfactory in Examples 
4-6 with 200 covariates. 

Figure 1 shows the frequencies of individual covariate effects being cor- 
rectly "classified": zero versus nonzero. For better resolution, we only plot 

Table 1 

Simulation study: comparison of OLS, RR, LASSO, Elastic net and the bridge estimator 
with 7= 1/2. PMSE: median of PMSE, inside "(•)" are the corresponding standard 
deviations. Covariate: median of number of covariates with nonzero coefficients 



Example 




OLS 


RR 


LASSO 


ENet 


Bridge 


1 


PMSE 


3.32 (0.58) 


3.51 (0.69) 


2.92 (0.51) 


2.80 (0.47) 


2.95 (0.51) 




Covariate 


30 


30 


23 


22 


17 


2 


PMSE 


3.21 (0.53) 


2.65 (0.41) 


2.60 (0.40) 


2.46 (0.35) 


2.37 (0.36) 




Covariate 


30 


30 


18 


16 


15 


3 


PMSE 


3.26 (0.58) 


3.34 (0.58) 


2.66 (0.40) 


2.38 (0.33) 


2.31 (0.34) 




Covariate 


30 


30 


18 


15 


15 


4 


PMSE 




20.45 (2.02) 


3.55 (0.64) 


3.30 (0.53) 


3.98 (0.83) 




Covariate 




200 


37 


37 


29 


5 


PMSE 




5.80 (1.31) 


2.71 (0.42) 


2.50 (0.36) 


2.64 (0.44) 




Covariate 




200 


25 


16 


15 


6 


PMSE 




43.10 (2.23) 


3.51 (0.57) 


2.70 (0.49) 


2.68 (0.39) 




Covariate 




500 


43 


20 
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Example 1 



Example 4 




Example 2 



Example 5 





Example 



Examples 




FlG. 1. Simulation study (Examples 1-6): probability of individual covariate effect being 
correctly identified. Circle: LASSO; Triangle: ENet; Plus sign: Bridge estimate. 



Table 2 

Simulation study: comparison of OLS with the first 15 covariates (OLS-oracle), bridge 
estimate with the first 15 covariates (bridge-oracle) and bridge estimate with all 
covariates. For each model, the first row: median of absolute bias (across the 15 
covariates) and median of variance (across the 15 covariates); the second row: median of 
EMSE and standard deviation of EMSE 



Example 




OLS-oracle 


Bridge-oracle 


Bridge 


1 


bias/sd 


0.007, 0.047 


0.019, 0.045 


0.035, 0.020 




EMSE 


0.647, 0.306 


0.625, 0.305 


0.702, 0.311 


2 


bias/sd 


0.014, 0.509 


0.114, 0.053 


0.024, 0.018 




EMSE 


7.252, 3.707 


0.910, 1.109 


0.990, 0.738 


3 


bias/sd 


0.041, 2.041 


0.026, 0.080 


0.028, 0.007 




EMSE 


30.15, 14.01 


0.163, 3.468 


0.133, 0.898 


4 


bias/sd 


0.006, 0.043 


0.014, 0.042 


0.061, 0.062 




EMSE 


0.655, 0.293 


0.662, 0.281 


1.186, 0.849 


5 


bias/sd 


0.036, 0.535 


0.133, 0.051 


0.050, 0.467 




EMSE 


7.077, 3.565 


1.179, 0.714 


7.013, 3.629 


6 


bias/sd 


0.035, 1.928 


0.027, 0.078 


0.072, 1.923 




EMSE 


28.90, 12.46 


0.218, 2.967 


28.43, 12.65 
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the first 30 covariates for Examples 4-6. We see that the bridge estimator 
can effectively identify large and moderate nonzero covariate effects and zero 
covariate effects. 

Simulation studies were also carried out to investigate the asymptotic 
oracle property of the bridge estimator. This property says that bridge esti- 
mators have the same asymptotic efficiency as the estimator obtained under 
the knowledge of which coefficients are nonzero and which are zero. To eval- 
uate this property, we consider three estimators: OLS using the covariates 
with nonzero coefficients only (OLS-oracle); the bridge estimator using the 
covariates with nonzero coefficients (bridge-oracle) ; and the bridge estimator 
using all the covariates. We note that the OLS-oracle and bridge-oracle es- 
timators cannot be used in practice. We use them here only for the purpose 
of comparison. We use the same six examples as described above. 

Table 2 presents the summary statistics based on 500 replications. In Ex- 
amples 1-3, the bridge estimator and bridge-oracle estimators perform sim- 
ilarly. In Examples 4-6, the bridge estimator is similar to the OLS-oracle 
estimator. In Examples 2 and 3 where the covariates are highly correlated, 
the OLS-oracle estimators have considerably larger EMSEs than the bridge- 
oracle and bridge estimators. In Examples 4 and 6, the OLS-oracle estima- 
tors and the two-step estimators have considerably larger EMSEs than the 
bridge-oracle estimators. This is due to the fact that OLS estimators tend 
to perform poorly when there is strong collinearity among covariates. The 
simulation results from these examples also suggest that, in finite samples, 
bridge estimators provide substantial improvement over the OLS estimators 
in terms of EMSE in the presence of strong collinearity. 

5. Concluding remarks. In this paper we have studied the asymptotic 
properties of bridge estimators when the number of covariates and regres- 
sion coefficients increases to infinity as n — > oo. We have shown that, when 
< 7 < 1, bridge estimators correctly identify zero coefficients with prob- 
ability converging to one and that the estimators of nonzero coefficients 
are asymptotically normal and oracle efficient. Our results generalize the 
results of Knight and Fu (2000), who studied the asymptotic behavior of 
LASSO-type estimators in the finite-dimensional regression parameter set- 
ting. Theorems 1 and 2 were obtained under the assumption that the num- 
ber of parameters is smaller than the sample size, as described in conditions 
(A2) and (A3). They are not applicable when the number of parameters 
is greater than the sample size, which arises in microarray gene expression 
studies. Accordingly, we have also considered a marginal bridge estimator 
under a partial orthogonality condition in which the covariates of zero co- 
efficients are orthogonal to or only weakly correlated with the covariates of 
nonzero coefficients. The marginal bridge estimator can consistently distin- 
guish covariates with zero and nonzero coefficients even when the number of 
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zero coefficients is greater than the sample size. Indeed, the number of zero 
coefficients can be in the order of exp(o(n)). 

We have proposed a gradient based algorithm for computing bridge esti- 
mators. Our simulation study suggests this algorithm converges reasonably 
rapidly. It also suggests that the bridge estimator with 7 = 1/2 behaves 
well in our simulated models. The bridge estimator correctly identifies zero 
coefficients with higher probability than do the LASSO and Elastic-net es- 
timators. It also performs well in terms of predictive mean square errors. 
Our theoretical and numerical results suggest that the bridge estimator with 
< 7 < 1 is a useful alternative to the existing methods for variable selection 
and parameter estimation with high-dimensional data. 

6. Proofs. In this section we give the proofs of the results stated in 
Sections 2 and 3. For simplicity of notation and without causing confusion, 
we write X n , Xi„ and X2„, as X, Xi and X2. 

We first prove the following lemma which will be used in the proof of 
Theorem 1. 

Lemma 1. Let u be a p n x 1 vector. Under condition (Al)(a) ; 



Proof. By the Cauchy-Schwarz inequality and condition (Al), we have 



E sup ^£jX^u <8an x l 2 p\[ 2 
||w||<<5 i=i 



2 



E SUp VJejX^U <E SUp ||u|| 2 gjXj 

\W\<S 8=1 \\ U \\< S i = l 



<5 2 E YJe^YJeiXi 



Li=l i=l 



11 



i=l 





Thus, the lemma follows from Jensen's inequality. 



□ 



Proof of Theorem 1. We first show that 



(8) 



Op((p n + KK)/ (npin)) 



1/2 
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By the definition of (3. 



Pn n Pn 



E(^ - x^J 2 + A„ E l/?il 7 < - x ^o) 2 + An E l^l 7 - 

i=l j=l i=l j=l 

It follows that 

n n Pn 

j2(Yi - AX? < E( y * - x ^o) 2 + An E \M- 

i=l i=l j'=l 

Let Vn = XnE P j li\Po j \\ then 

n n 

Vn > E(^ " X ^n) 2 " E(^ " X ^0) 2 
i=l i=l 

n n 

= E^(3n - Po)f + 2E^(/3 - 3 n ). 

i=l i=l 

Let d n = n 1 /2( Sn )i/2(3 n _ /3o)) Dn = n -i/2(E n )-i/2 X ' and e n = (e u ..., £„)'• 
Then 

n n 

Et^(3„ - /3 )] 2 + 2E^^(/3 - 3 n ) = 5' n 5 n - 2(D n e n )'* n . 

i=l i=l 

So we have S' n S n - 2(D n e)'S n - rj n < 0. That is, \\S n - D n e n || 2 - ||D n e n || 2 - 

1/2 

Vn < 0. Therefore, ||5 n — D n e n || < ||D n e n || +r] n ■ By the triangle inequality, 

\\S n \\ < \\S n - D n e n || + ||D n e n || < 2||D n e„|| + r^J 2 . 

It follows that ||#n|| 2 < 6||D n e n || 2 + 3rj n . Let dj be the ith column of D n . 
Then D n e = FXi d^. Since Ee^- = if i / j, E||D n e n || 2 = £? =1 ||d; || 2 Ee 2 = 
<r 2 tr(D n DjJ = <7 2 p n . So we have E||<5 n || 2 < 6<7 2 p n + 3rj n . That is, 

(9) nE[(3 n - /3 )'S n (3„ - /3 )] < Qa 2 Pn + 3 ??n . 

Since the number of nonzero coefficients is k n , % = A n 2^j=i IA)j| 7 = 0(A n fc n ). 
Noting that is the smallest eigenvalue of Ei n , (8) follows from (9). 
We now show that 

(10) \\X-M=0 P (p^{Pn/n) l l 2 ). 

Let r n = pinin/Pn) 1 ^ 2 ■ The proof of (10) follows that of Theorem 3.2.5 of 
Van der Vaart and Wellner (1996). For each n, partition the parameter 
space (minus (3 ) into the "shells" S j>n = {(3 : 2^" 1 < r n \\/3 - (3 \\ < 2-?'} with 
j ranging over the integers. If r ra ||/3 n — /3 || is larger than 2 M for a given 
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integer M, then (3 n is in one of the shells with j > M. By the definition of 
(3 n that it minimizes L n (/3), for every e > 0, 

P(r n ||3 n -/3 ||>2 A/ ) 

= E P ( (^n(/3)-L n (/3 ))<o)+P(2||3 n -/3 ||>e). 

Because /3 n is consistent by (8) and condition (A2), the second term on the 
right-hand side converges to zero. So we only need to show that the first 
term on the right-hand side converges to zero. Now 

L n Q3) - L n (J3 ) 

= X ^) 2 + AnE \M + A "E \M 

i=l j=l j=l 



^(y i -w^ 10 ) 2 -A n ^|/5 01 ,r 



i=l i=l 

TTi kn Th /Cyj 



> £(y, - x^/3) 2 + A n £ |0y r - ]T(y - w^/3 10 ) 2 - |/? i,r 

i=l j'=l i=l i=l 

n n k n 

= £te(/3 - /3 )] 2 - 2 $>^(/3 - A,) + A n £{|A;P - |A)ij| 7 } 

i=l i=l j'=l 

= ^ln + ^2n + ^3n- 

On 5j >n , the first term I\ n > The third term 

kn 

hn = A„7E l^oijl 7-1 sgn(/3oij)(/3ij - /3 ij), 
j'=i 

for some between /?oij and By condition (A4) and since we only 
need to consider (3 with ||/3 — /3 || < e, there exists a constant C3 > such 
that 

kn 

\hn\ < c 3 7AnE I^J — /5oii I < <=3T fc^ /2 1 1 /3 — /3 ll - 
i=i 

So on 5 Jiri , J 3n > -c 3 A n fcy 2 (2''7r ri ). Therefore, on Sj >n , 

L n (p) - L n (/3 ) > -|/ 2n | + n /3ln (2 2 ^- 1 )/r 2 ) - c^k^^ jr n ). 
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It follows that 

P( inf (L n (p)-L n (J3 ))<0) 

<p( sup |I 2n | > np ln (2 2 ^ /r 2 n ) - c z \ n k]l 2 {2? Ir n ) 
2n 1 /2 p V2 (2i/rn) 



np ln (2 2 0-D/r2) - C3 A n ^ /2 (2Vrn) 
2 

~2i-2- C3 A n A;y 2 (np n ,)-i/2' 

where the second inequality follows from Markov's inequality and Lemma 1. 
Under condition (A3) (a), \ n kl/ 2 (np n )~ 1 ^ 2 — > as n — > oo. So for n suffi- 
ciently large, 2 J_2 — c%\ n kn 2 {np n )~ 1 / 2 > 2 J ~ 3 for all jf > 3. Therefore, 

E P ( i 11 / (£*C0) - L nW) < 0) < E ^2 < 2 " (M " 3) ' 

which converges to zero for every M = M n — > oo. This completes the proof 
of (10). Combining (8) and (10), the result follows. This completes the proof 
of Theorem 1. □ 

Lemma 2. Suppose that < 7 < 1. Let /3 n = (/3 ln , (3 2n )' ■ Under condi- 
tions (Al) to (A4), (3 2n = uref/i probability converging to 1. 

Proof. By Theorem 1, for a sufficiently large C, /3 n lies in the ball 
{/3 : 11/3 — || < /i n C} with probability converging to 1, where h n = Pi^{p n /n) 1 / 2 . 
Let (3 ln = /3 01 + /i„ui and /3 2n = /3 02 + /i n u 2 = h n u 2 with ||u||| = ||ui ||| + 
||u 2 ||| <C 2 . Let 

Vniut, u 2 ) = L n ((3 ln , (3 2n ) - L n (/3 1Q ,0) = L n ((3 10 + /i n ui, /t„u 2 ) - L n (j3 w , 0). 

Then /3 ln and /3 2n can be obtained by minimizing V^(ui,u 2 ) over ||u|| < 
C, except on an event with probability converging to zero. To prove the 
lemma, it suffices to show that, for any ui and u 2 with ||u|| < C, if ||u 2 || > 0, 
V^(ui,u 2 ) — V n (ui,0) > with probability converging to 1. Some simple 
calculation shows that 

n n 

V n (m, u 2 ) - V n (m, 0) = hlY^i^) 2 + 2h 2 n Y / «u 1 )(z' i u 2 ) 

i=l i=l 

n m n 

- 2h n ^2e i {z' i u 2 ) + X n hlYl l u 2j| 7 
i=i j=i 



II ln + II 2n + II 3n + // 
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For the first two terms, we have 

n 



i=l 



Ihn + Ihn > h 2 n J2«"2) 2 ~ ^El( W >l) 2 + 

i=l 

n 



(11) 



,2_ n — n 2 



> -nh n T 2n 1 1 ui | 

> -T2(Pn/pln)C 2 , 



where we used condition (A5)(a) in the last inequality. For the third term, 
since 

1/2 



E 



n 






E £ » Z i U 2 


< 




i=l 







(7 



i=l 



1/2 



s- 1/2 1/2 II | 

<an i p 2 ' n \\u 2 \ 
< a{np n ) l ' 2 C, 



we have 
(12) 



II 3n = h n n 1 / 2 p 1 n /2 O p (l) = ( Pn /pm)O p (l). 
For the fourth term, we first note that 

In ~| 2 /7 m n 



■i=i 



>£Kf = ||u 2 f. 

i=i 
Thus, 

(13) // 4 n = A n ^O(||u 2 P). 

Under condition (A3)(b), \ n h1/(p n p^ 2 ) = \ n rfl/ 2 (p ln / y^) 2-7 -> oo. Com- 
bining (11), (12) and (13), we have, for 1 1 u.2 [ 1 2 > 0, V^(u) > with probability 
converging to 1. This completes the proof of Lemma 2. □ 

Proof of Theorem 2. Part (i) follows from Lemma 1. We need to 
prove (ii). Under conditions (Al) and (A2), f3 n is consistent by Theo- 
rem 1. By condition (A4), each component of /3 ln stays away from zero 
for n sufficiently large. Thus, it satisfies the stationary equation evaluated 
at (3 ln ,3 2 J, (d/0/3i)Ln(3m,32n) = 0. That is, -2E^i(*i - w *3m " 
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z ';/^2n) w i + ^njipn = 0, where ip n is a k n x 1 vector whose jth element is 
\PinjP~ 1 sgn(/?i n j). Since /3 20 = and Ei = Yi — w^/3 10 , this equation can be 
written -2X^1 ( £ i - w-(/3 ln - /3 10 ) - z' i P 2n )v/ i + A n 7^n = 0. Therefore, 

n n 

T I * * ) t I > t < * 



n r— f 2n n rr f 

It follows that 

n y 2 cx' n {p ln -i3 X0 ) 

n n 
i=l i=l 

By (i), P((3 2n = 0) — > 1. Thus, the last term on the right-hand side equals 
zero with probability converging to 1. It certainly follows that it converges 
to zero in probability. When \\ct n \\ < 1, under condition (A4), 

^XSrnVnl^n-^Vll^ll-II^J-Mll 



<2n-V2 Ti -i A; i/2 6 -(i-7) ) 



except on an event with probability converging to zero. Under (A3) (a), 
A n (Wn) 1/2 -> 0- Therefore, 



(i4) n l / 2 s - l <(3i„ - fro) = ™ _1/2 ^ 1 E £ i«n S r>i + o P (l). 



i=l 



We verify the conditions of the Lindeb erg-Feller central limit theorem. 
Let Vi = n~ 1 ' 2 s~ 1 oc' n E^Wi and Wi =£iVi. First, 



Var [y^Wi \=n l a 2 s n 2 ^ a^S^WiW^S^a™ = s n 2 s 2 = 1. 

\i=l / i=l 

For any e > 0, £™ =1 E[w 2 l{\ Wi \ > e}} = a 2 £™ =1 v 2 Ee 2 l{\e iVi \ > e}. Since 



n n 
j=l i=l 

it suffices to show that, maxi<j< n Ee 2 l{|ejt;j| > e} — > 0, or equivalently, 
(15) max \v-i\ = n~ 1 ^ 2 s" 1 max |a^Sj~^Wj| — ► 0. 

l<i<n l<i<n 

Since (a^S^Wil < (c^E^a^/^w-E^Wi) 1 / 2 and s^ 1 = cr~ 1 (o; n S^Q; n )" 1 / 2 , 
we have 

max < <T _1 n -1 / 2 max (w^E^Wj) 1 / 2 < cr 1 ^ 1 ^ 2 n -1 ^ 2 max (w^w^) 1 / 2 , 

l<j<n l<i<n l<i<n 
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(15) follows from assumption (A5). This completes the proof of Theorem 2. 

□ 



Lemma 3 [Knight and Fu (2000)]. Let g(u) = u 2 — 2au + \\u\ 1 , where 
a^O, A > 0, and < 7 < 1. Denote 

2 \/2(1- 7 )\ 1 -t 



2-7/ V 2-7 

Suppose that a / 0. Then argmin^) = if and only if A > c 7 |o| 2-7 . 

Let ip 2 {x) = exp(x 2 ) — 1. For any random variable X, its -02-Orlicz norm 
\\X\\ ^ 2 is defined as ||Af ||^ 2 — inf{C > : Etp 2 (\X\/C) < 1}. The Orlicz norm 
is useful for obtaining maximal inequalities; see Van der Vaart and Wellner 
(1996), Section 2.2. 

Lemma 4. Let ci,...,c n be constants satisfying 2~27=i c i = 1> an d ^ 
W = Y2=iC i e i . 

(i) Under condition (Bl), \\W\\^ 2 < K 2 [a + ((1 + K^" 1 ) 1 ' 2 }, where K 2 
is a constant. 

(ii) Let Wi,...,W m be random variables with the same distribution as 
W . For any w n > 0, 



P [w n > max \Wj\ > 1 
for a constant K not depending on n. 



(log2) 1 / 2 K(logm) 1 / 2 



w,, 



Proof, (i) Without loss of generality, assume Cj 7^ 0, i = 1, . . . , n. First, 
because £j is sub-Gaussian, its Orlicz norm ||£i|L 2 < [(l + i^/C] 1 / 2 [Lemma 2.2.1, 
Van der Vaart and Wellner (1996)]. By Proposition A. 1.6 of Van der Vaart 
and Wellner (1996), there exists a constant K 2 such that 



n 




n 




n 






<kJe 




+ 


X] W Ci£i ^2 


1/2 | 


i=l 




2=1 




A=l 





(l + ^C-^c 2 



nl/2- 



= K 2 [a + ((l + K)C~ 1 



1x1/2, 



(ii) By Lemma 2.2.2 of Van der Vaart and Wellner (1996), || maxi<j< 9n Wj||^ 2 < 
K(\ogm) 1 / 2 for a constant K. Because E|W| < (log2) 1//2 ||iy||^, 2 for any ran- 
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dom variable W, we have 



E( max \Wj\ J < (log2) 1 / 2 K(logm) 1 / 2 



\1<7 <m n 

for a constant K. By the Markov inequality, we have 

P ( w n > max \Wj\) =1-P( max \WA > w n 



l<j<m 



This completes the proof. □ 



> 1 



^l<j<m„ 
(log2) 1 /2^(l ogm )l/2 



Proof of Theorem 3. Recall £ n j = re -1 X^=i( w i/3io) 

defined in 

(5). Let &j = (x±j, . . . , x n j)' . Write 



Pn n 



Pn 



j=i i=i j=i 

Pn [ n 

= Y, Y £ i+ n Pj - 2 ( £ « a i + n Znj)Pj + K\Pj I 
j=l Li=l 

So minimizing U n is equivalent to minimizing X^=i [ n /^| ~~ 2{ £ n a j + n Cnj)f3j + 
An 1/3, 1 7 ]- Let 

= n /^ 2 - 2 04 a j + n Uj)Pj + A„|/3j| 7 , j = 1, . . . ,p n . 
By Lemma 3, 0j = is the only solution to gj(Pj) = if and only if 



n 1 \ n > c 7 (n 1 |e n aj + n£. 



Let u> n = Oy 1 ^ 2 7 ^ (An/n 7 / 2 ) 1 ^ 2 7 \ This inequality can be written 
(16) w n >n~ l/2 \e' n a.j + n£ n j\. 

To prove the theorem, it suffices to show that 



(17) 
and 
(18) 



P ( w n > n x l 2 max \s' n a.j + n^ n j | ) — > 1 



j'6Jn 



P( w n > n x l 2 min \e' n aj + n.£ n J ] — ► 0. 



We first prove (17). By condition (B2)(a), there exists a constant Co > 
such that 



n 



-1/2 



i=l 



<c , jeJ n ,keK n 
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for all n sufficiently large. Therefore, 



-1/2 



fc=li=l 



(19) 



<n-^ bl Y: 



1=1 



i=i 



where &i is given in condition (B4). Let ci = b\CQ. By (16) and (19), we have 
P( w n > n~ x l 2 max \e' n a.j + ?i£ n:; 



(20) 



> P ( w n > n l ^ 2 max \s' n a.j | + n 1//2 max | £ n ,- 1 ) 

> P ( w n > n~ l l 2 max |e^a,-| + c\k n 



1 — P( n 1 ^ 2 max |£^a_,| >w n — C\k. 



> 



1 E ( n 1/2max jgJn K a il) 
W n C\k n 



By Lemma 4(i), n l l 2 s' n 3Lj is sub-Gaussian, 1 < j < m n . By condition (B3)(a), 



(2-7) x 1/(2-7) 



Thus, by Lemma 4(h), combining (20) and (21), and by condition (B3)(b), 



P ( w n > n l l 2 max | e' n a.j + n£ n j | ) > 1 



(\og2) 1 l 2 K(\ogm n ) 1 l 2 



W n ~ Cik n 



1. 



This proves (17). We now prove (18). We have 



P[w n > min \n l l 2 e' n a.j + re 1//2 £ n ,- 



(22) 



= P( |J {In-^e'^ + n 1 /^.^^}] 
< J2 ?(\n~ 1/2 <*i + nl/2 Cnj\ < w n ). 



Write 



Y>{\n- 1 / 2 e' n * j + n l l 2 t nj \< 



i". 
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(23) 

= 1 - Pdn-VV^ + n l ' 2 i n3 \ > w n ). 

By condition (B2)(b), minj G x„ \£nj\ > £o > for all n sufficiently large. By 
Lemma 4, n~ l l 2 e' n 3Lj are sub-Gaussian. We have 



Pfln-VV^ + n 1 / 2 * nj \> 



(24) 

= l-P(n- 1 / 2 |£^a i |>n 1 / 2 |^.|- l( ; r 
^l-Xexpt-^n 1 / 2 ^-^) 2 ]. 
By (22), (23) and (24), we have 

PI w n > min |n _1//2 e^aj + n 1 ^ 2 ^ 

<A; n i<:exp[-C(n 1 / 2 eo-^n) 2 ] 
By condition (B3)(a), we have 



Wn 0(1] 



n l/2 V ^ n ( 2 - 7 )/2 

= 0(l)(A n /n) 1 /( 2 ^)=o(l). 

Therefore, 

P(i(; ri > min |n _1 / 2 e^aj + n 1//2 ^ n j| ) = 0(l)k n exp(— Cn) = o(l), 

where the last equality follows from condition (B3)(a). Thus, (18) follows. 
This completes the proof of Theorem 3. □ 

Proof of Theorem 4. By Theorem 3, Conditions (Bl) to (B4) en- 
sure that the marginal bridge estimator correctly selects covariates with 
nonzero and zero coefficients with probability converging to one. Therefore, 
for asymptotic analysis, the second step estimator (3 n can be defined as the 
value that minimizes U* defined by (6). We now can prove Theorem 4 in two 
steps. First, under conditions (Bl)(a) and (B6), consistency of /3 ln follows 
from the same argument as in the proof of Theorem 1. Then under condi- 
tions (Bl)(a), (B5) and (B6), asymptotic normality can be proved the same 
way as in the proof of Theorem 2. This completes the proof of Theorem 4. 
□ 
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